The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.
Object recognition
The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.
● All the features are geometric features extracted from the silhouette.
● All are numeric in nature.
Apply dimensionality reduction technique – PCA and train a model using principle components instead of training the model using just the raw data.
import numpy as np # for dataframe handling
import pandas as pd #array handling
import seaborn as sns # plotting
sns.set(color_codes=True)
import matplotlib.pyplot as plt # plotting
%matplotlib inline
# For preprocessing the data
from sklearn import preprocessing
# To split the dataset into train and test datasets
from sklearn.model_selection import train_test_split
from scipy import stats
from sklearn import metrics
# To model the SVM
from sklearn import svm
#importing the Encoding library
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report,confusion_matrix
# To calculate the accuracy score of the model
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')
#KFold cross validation
from sklearn.model_selection import KFold
#Build the model with the best hyper parameters
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
#Importing PCA for dimensionality reduction and visualization
from sklearn.decomposition import PCA
# Import Support Vector Classifier machine learning library
from sklearn.svm import SVC
# Reading the data as dataframe and print the first five rows
data = pd.read_csv('vehicle-1.csv')
data.head()
data.columns
data.shape #no of rows and columns in the dataframe
There are 846 rows and 19 Columns in the DataFrame
data.dtypes # to get the data type of each attributes
All the input features are numerics of type integer or float. Only the "class" is of type object, as it is categorical in nature.
data.isnull().sum()
There are missing Values in majority of the columns
# Split the dataset according to their class types
unique_vehicles = [data[data['class'] == veh] for veh in data['class'].unique()]
# Replaces the NULLs with the median of the respective feature
for unique_veh in unique_vehicles:
for col in unique_veh.columns[:-1]:
median = unique_veh[col].median()
unique_veh[col] = unique_veh[col].fillna(median)
# Join the splitted datasets back together and sort the index
data = pd.concat(unique_vehicles).sort_index()
data.isnull().sum()
data.describe().transpose()
summary=data.describe().T
summary[['min', '25%', '50%', '75%', 'max']]
data.skew(numeric_only = True)
Skewness with positive values indicates data is skewed towards right. Skewness with negative values indicates data is skewed towards left
# A quick check to find columns that contain outliers
fig = plt.figure(figsize = (15, 7.2))
ax = sns.boxplot(data = data.iloc[:, 0:18], orient = 'h')
From above graph we found that many of the columns are having outliers
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data['compactness'], showfliers=True,color='c').set_title("Distribution of 'compactness'")
#distplot
plt.subplot(3,3,2)
sns.distplot(data['compactness'],color='m').set_title("compactness Vs Frequency")
#histogram plot
plt.subplot(3,3,3)
data['compactness'].plot.hist(color='g').set_title("compactness Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data['circularity'], showfliers=True,color='c').set_title("Distribution of 'circularity'")
#distplot
plt.subplot(3,3,2)
sns.distplot(data['circularity'],color='m').set_title("circularity Vs Frequency")
#histogram plot
plt.subplot(3,3,3)
data['circularity'].plot.hist(color='g').set_title("circularity Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data['distance_circularity'], showfliers=True,color='c').set_title("Distribution of 'distance_circularity'")
#distplot
plt.subplot(3,3,2)
sns.distplot(data['distance_circularity'],color='m').set_title("distance_circularity Vs Frequency")
#histogram plot
plt.subplot(3,3,3)
data['distance_circularity'].plot.hist(color='g').set_title("distance_circularity Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data['radius_ratio'], showfliers=True,color='c').set_title("Distribution of 'radius_ratio'")
#distplot
plt.subplot(3,3,2)
sns.distplot(data['radius_ratio'],color='m').set_title("radius_ratio Vs Frequency")
#histogram plot
plt.subplot(3,3,3)
data['radius_ratio'].plot.hist(color='g').set_title("radius_ratio Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data['pr.axis_aspect_ratio'], showfliers=True,color='c').set_title("Distribution of 'pr.axis_aspect_ratio'")
#distplot
plt.subplot(3,3,2)
sns.distplot(data['pr.axis_aspect_ratio'],color='m').set_title("pr.axis_aspect_ratio Vs Frequency")
#histogram plot
plt.subplot(3,3,3)
data['pr.axis_aspect_ratio'].plot.hist(color='g').set_title("pr.axis_aspect_ratio Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data['max.length_aspect_ratio'], showfliers=True,color='c').set_title("Distribution of 'max.length_aspect_ratio'")
#distplot
plt.subplot(3,3,2)
sns.distplot(data['max.length_aspect_ratio'],color='m').set_title("max.length_aspect_ratio Vs Frequency")
#histogram plot
plt.subplot(3,3,3)
data['max.length_aspect_ratio'].plot.hist(color='g').set_title("max.length_aspect_ratio Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data['scatter_ratio'], showfliers=True,color='c').set_title("Distribution of 'scatter_ratio'")
#distplot
plt.subplot(3,3,2)
sns.distplot(data['scatter_ratio'],color='m').set_title("scatter_ratio Vs Frequency")
#histogram plot
plt.subplot(3,3,3)
data['scatter_ratio'].plot.hist(color='g').set_title("scatter_ratio Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data['elongatedness'], showfliers=True,color='c').set_title("Distribution of 'elongatedness'")
#distplot
plt.subplot(3,3,2)
sns.distplot(data['elongatedness'],color='m').set_title("elongatedness Vs Frequency")
#histogram plot
plt.subplot(3,3,3)
data['elongatedness'].plot.hist(color='g').set_title("elongatedness Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data['pr.axis_rectangularity'], showfliers=True,color='c').set_title("Distribution of 'pr.axis_rectangularity'")
#distplot
plt.subplot(3,3,2)
sns.distplot(data['pr.axis_rectangularity'],color='m').set_title("pr.axis_rectangularity Vs Frequency")
#histogram plot
plt.subplot(3,3,3)
data['pr.axis_rectangularity'].plot.hist(color='g').set_title("pr.axis_rectangularity Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data['max.length_rectangularity'], showfliers=True,color='c').set_title("Distribution of 'max.length_rectangularity'")
#distplot
plt.subplot(3,3,2)
sns.distplot(data['max.length_rectangularity'],color='m').set_title("max.length_rectangularity Vs Frequency")
#histogram plot
plt.subplot(3,3,3)
data['max.length_rectangularity'].plot.hist(color='g').set_title("max.length_rectangularity Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data['scaled_variance'], showfliers=True,color='c').set_title("Distribution of 'scaled_variance'")
#distplot
plt.subplot(3,3,2)
sns.distplot(data['scaled_variance'],color='m').set_title("scaled_variance Vs Frequency")
#histogram plot
plt.subplot(3,3,3)
data['scaled_variance'].plot.hist(color='g').set_title("scaled_variance Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data['scaled_variance.1'], showfliers=True,color='c').set_title("Distribution of 'scaled_variance.1'")
#distplot
plt.subplot(3,3,2)
sns.distplot(data['scaled_variance.1'],color='m').set_title("scaled_variance.1 Vs Frequency")
#histogram plot
plt.subplot(3,3,3)
data['scaled_variance.1'].plot.hist(color='g').set_title("scaled_variance.1 Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data['scaled_radius_of_gyration'], showfliers=True,color='c').set_title("Distribution of 'scaled_radius_of_gyration'")
#distplot
plt.subplot(3,3,2)
sns.distplot(data['scaled_radius_of_gyration'],color='m').set_title("scaled_radius_of_gyration Vs Frequency")
#histogram plot
plt.subplot(3,3,3)
data['scaled_radius_of_gyration'].plot.hist(color='g').set_title("scaled_radius_of_gyration Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data['scaled_radius_of_gyration.1'], showfliers=True,color='c').set_title("Distribution of 'scaled_radius_of_gyration.1'")
#distplot
plt.subplot(3,3,2)
sns.distplot(data['scaled_radius_of_gyration.1'],color='m').set_title("scaled_radius_of_gyration.1 Vs Frequency")
#histogram plot
plt.subplot(3,3,3)
data['scaled_radius_of_gyration.1'].plot.hist(color='g').set_title("scaled_radius_of_gyration.1 Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data['skewness_about'], showfliers=True,color='c').set_title("Distribution of 'skewness_about'")
#distplot
plt.subplot(3,3,2)
sns.distplot(data['skewness_about'],color='m').set_title("skewness_about Vs Frequency")
#histogram plot
plt.subplot(3,3,3)
data['skewness_about'].plot.hist(color='g').set_title("skewness_about Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data['skewness_about.1'], showfliers=True,color='c').set_title("Distribution of 'skewness_about.1'")
#distplot
plt.subplot(3,3,2)
sns.distplot(data['skewness_about.1'],color='m').set_title("skewness_about.1 Vs Frequency")
#histogram plot
plt.subplot(3,3,3)
data['skewness_about.1'].plot.hist(color='g').set_title("skewness_about.1 Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data['skewness_about.2'], showfliers=True,color='c').set_title("Distribution of 'skewness_about.2'")
#distplot
plt.subplot(3,3,2)
sns.distplot(data['skewness_about.2'],color='m').set_title("skewness_about.2 Vs Frequency")
#histogram plot
plt.subplot(3,3,3)
data['skewness_about.2'].plot.hist(color='g').set_title("skewness_about.2 Vs Frequency");
plt.figure(figsize= (20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data['hollows_ratio'], showfliers=True,color='c').set_title("Distribution of 'hollows_ratio'")
#distplot
plt.subplot(3,3,2)
sns.distplot(data['hollows_ratio'],color='m').set_title("hollows_ratio Vs Frequency")
#histogram plot
plt.subplot(3,3,3)
data['hollows_ratio'].plot.hist(color='g').set_title("hollows_ratio Vs Frequency");
# Plot the central tendency of the dataset
_, bp = data.boxplot(return_type='both', figsize=(20,10), rot='vertical')
fliers = [flier.get_ydata() for flier in bp["fliers"]]
boxes = [box.get_ydata() for box in bp["boxes"]]
caps = [cap.get_ydata() for cap in bp['caps']]
whiskers = [whiskers.get_ydata() for whiskers in bp["whiskers"]]
# Count the number of outlier data points present in each feature
for idx, col in enumerate(data.columns[:-1]):
print(col, '--', len(fliers[idx]))
There are 8 Columns in the dataset which contains outliers
# Treat the outlier data points
for idx, col in enumerate(data.columns[:-1]):
q1 = data[col].quantile(0.25)
q3 = data[col].quantile(0.75)
low = q1 - 1.5*(q3 - q1)
high = q3 + 1.5*(q3 - q1)
data.loc[(data[col] < low), col] = caps[idx * 2][0]
data.loc[(data[col] > high), col] = caps[idx * 2 + 1][0]
# Check the dataset after Outlier treatment
fig = plt.figure(figsize = (15, 7.2))
ax = sns.boxplot(data = data.iloc[:, 0:18], orient = 'h')
class_counts = pd.DataFrame(data["class"].value_counts()).reset_index()
class_counts.columns =["Labels","class"]
class_counts
# Check the frequency distribution of each target class
fig, axes = plt.subplots(1, 2, figsize=(16,6))
sns.countplot(data["class"], ax=axes[0], palette='rocket')
_ = axes[1].pie(data["class"].value_counts(), autopct='%1.1f%%', shadow=True, startangle=90, labels=data["class"].value_counts().index)
# Compare class wise mean
pd.pivot_table(data, index='class', aggfunc=['mean']).T
for i in data:
if i != 'class' and i != 'name':
sns.catplot(x='class',y=i,kind='box',data=data)
#Encoding of categorical variable
labelencoder_X=LabelEncoder()
data['class']=labelencoder_X.fit_transform(data['class'])
sns.pairplot(data,diag_kind='kde',hue='class');
plt.figure(figsize=(25, 25))
ax = sns.heatmap(data.corr(), vmax=.8, square=True, fmt='.2f', annot=True, linecolor='white', linewidths=0.01)
plt.title('Correlation of Attributes')
plt.show()
compactness is positively associated with circularity,distance_circularity,radius_ratio,scatter_ratio,pr.axis_rectangularity, max.axis_rectangularity, scaled_variance, scaled_variance.1 and is negatively associated with elonfatedness
circularity has positive association with distance_circularity, scatter_ratio, pr.axis_rectangularity, max.axis_rectangularity, scaled_variance, scaled_variance1, scaled_radius_of_gyration and has negative assiciation with elongatedness.
distance_circularity is positively associated with radius_ratio, scatter_ratio, pr.axis_rectangularity, max.axis_rectangularity, scaled_variance, scaled_variance1, scaled_radius_of_gyration and is negatively associated with elongatedness.
radius_ratio is positively associated with pr.axis_aspect_ratio, scatter_ratio, scaled_variance, scaled_variance1, scaled_radius_of_gyration is negatively associated with elongatedness.
pr.axis_aspect_ratio is positively associated with radius_ratio
skewness_about and skewness_about1 doesnot have much better correlation apart from themselves
scaled_radius_of_gyration.1 is negatively correlated with hollows_ratio and skewness_about.2.
elongatedness is negatively associated with almost all columns
#lets import extra tree classifier to find out the feature importance of data
from sklearn.ensemble import ExtraTreesClassifier
X=data.drop('class', axis=1)
y=data[['class']]
X.head()
# Feature Importance plot using Random Forest Classifier
rf = RandomForestClassifier().fit(X, y)
pd.DataFrame(rf.feature_importances_, index = data.columns[:-1],
columns=['Importance']).sort_values('Importance').plot(kind='barh', figsize=(15,7), title='Feature Importance')
X=X.drop(['scatter_ratio','pr.axis_rectangularity','scaled_variance.1', 'scaled_variance'], axis=1)
X.info()
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 10)
from sklearn.preprocessing import StandardScaler
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.transform(X_test)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
param_grid = [{'kernel': ['rbf'], 'gamma': [1,1e-1,1e-2, 1e-3, 1e-4, 1e-5],
'C': [0.01, 0.05,0.5,1, 10, 25, 50, 100]},
{'kernel': ['linear'], 'C': [0.001, 0.10, 0.1, 10]}]
grid = GridSearchCV(SVC(), param_grid, refit = True)
# fitting the model for grid search
grid.fit(X_train, y_train)
# print best parameter after tuning
print(grid.best_params_)
# print how our model looks after hyper-parameter tuning
print(grid.best_estimator_)
# With the best hyper parameters found as given above, the final model can be built as given below
clf=SVC(kernel='rbf', C=25, gamma=0.01)
clf.fit(X_train, y_train)
prediction = clf.predict(X_test)
# check the accuracy on the training data
print('Accuracy on Training data: ',clf.score(X_train, y_train))
# check the accuracy on the testing data
print('Accuracy on Testing data: ', clf.score(X_test , y_test))
#Calculate the recall value
print('Recall value: ',metrics.recall_score(y_test, prediction, average='macro'))
#Calculate the precision value
print('Precision value: ',metrics.precision_score(y_test, prediction, average='macro'))
print("Classification Report:\n",metrics.classification_report(prediction,y_test))
cm=metrics.confusion_matrix(y_test, prediction, labels=[1, 0])
df_cm = pd.DataFrame(cm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True, fmt="d")
#Store the accuracy results for each kernel in a dataframe for final comparison
resultsDf = pd.DataFrame({'Model':['SVM - Hyper Parameter tuned(without PCA)'], 'Accuracy':clf.score(X_test , y_test) },index={'2'})
resultsDf = resultsDf[['Model','Accuracy']]
resultsDf
#Scaling of independent attributes
from scipy.stats import zscore
XScaled = X.apply(zscore)
kfold = KFold(n_splits=50, random_state=7)
model = SVC()
results = cross_val_score(model, XScaled, y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
#Store the accuracy results for each kernel in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Model':['SVM - KFold(without PCA)'], 'Accuracy': results.mean()},index={'3'})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Model','Accuracy']]
resultsDf
XScaled=X.apply(zscore)
XScaled.head()
# generating the covariance matrix and the eigen values for the PCA analysis
cov_matrix = np.cov(XScaled.T) # the relevanat covariance matrix
print('Covariance Matrix \n%s', cov_matrix)
#generating the eigen values and the eigen vectors
e_vals, e_vecs = np.linalg.eig(cov_matrix)
print('Eigenvectors \n%s' %e_vecs)
print('\nEigenvalues \n%s' %e_vals)
# the "cumulative variance explained" analysis
tot = sum(e_vals)
var_exp = [( i /tot ) * 100 for i in sorted(e_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print("Cumulative Variance Explained", cum_var_exp)
# Plotting the variance expalained by the principal components and the cumulative variance explained.
plt.figure(figsize=(10 , 5))
plt.figure(figsize=(15,10))
plt.axhline(y=95, color='r', linestyle=':')
plt.bar(range(1, e_vals.size + 1), var_exp, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(1, e_vals.size + 1), cum_var_exp, where='mid', label = 'Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()
# Create a new matrix using the n components
X_projected = PCA(n_components=7).fit_transform(XScaled)
X_projected.shape
#Converting PCA Transformed data from Array to Dataframe to visualise in the pairplot
pca_df=pd.DataFrame(X_projected)
sns.pairplot(pca_df, diag_kind = 'kde')
# Divide the projected dataset into train and test split
X_projected_train, X_projected_test, y_train, y_test = train_test_split(X_projected, y, test_size=0.3, random_state=100)
X_projected_train.shape, X_projected_test.shape, y_train.shape, y_test.shape
param_grid = [{'kernel': ['rbf'], 'gamma': [1,1e-1,1e-2, 1e-3, 1e-4, 1e-5],
'C': [0.01, 0.05,0.5,1, 10, 25, 50, 100]},
{'kernel': ['linear'], 'C': [0.001, 0.10, 0.1, 10]}]
grid = GridSearchCV(SVC(), param_grid, refit = True)
# fitting the model for gridpca_y_trainsearch
grid.fit(X_projected_train,y_train)
# print best parameter after tuning
print(grid.best_params_)
# print how our model looks after hyper-parameter tuning
print(grid.best_estimator_)
clf=SVC(kernel='rbf', C=10, gamma=0.1)
clf.fit(X_projected_train, y_train)
pca_prediction = clf.predict(X_projected_test)
# check the accuracy on the training data
print('Accuracy on Training data: ',clf.score(X_projected_train, y_train))
# check the accuracy on the testing data
print('Accuracy on Testing data: ', clf.score(X_projected_test , y_test))
#Calculate the recall value
print('Recall value: ',metrics.recall_score(y_test, pca_prediction, average='macro'))
#Calculate the precision value
print('Precision value: ',metrics.precision_score(y_test, pca_prediction, average='macro'))
print("Classification Report:\n",metrics.classification_report(pca_prediction,y_test))
cm=metrics.confusion_matrix(y_test, pca_prediction, labels=[1, 0])
df_cm = pd.DataFrame(cm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True, fmt="d")
tempResultsDf = pd.DataFrame({'Model':['SVM(PCA) - Hyper parameter tuned'], 'Accuracy': clf.score(X_projected_test, y_test)},index={'5'})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Model','Accuracy']]
resultsDf
kfold = KFold(n_splits=50, random_state=7)
model = SVC()
results1 = cross_val_score(model, X_projected, y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results1.mean()*100.0, results1.std()*100.0))
tempResultsDf = pd.DataFrame({'Model':['SVM - KFold(PCA)'], 'Accuracy': results1.mean()},index={'6'})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Model','Accuracy']]
resultsDf
ax=sns.barplot(y="Model", x=("Accuracy"),data=resultsDf)
total = len(resultsDf["Accuracy"])
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_width())
x = p.get_x() + p.get_width() + 0.02
y = p.get_y() + p.get_height()/2
ax.annotate(percentage, (x, y))
plt.show()
When we applied Support vector classifier on the reduced Dimensions we got an accuracy of about 90.5% but the orginal dimensions scored better with 97% ### Effect of Principal Components Analysis can be more usefull in large datasets with more Dimensions
Multicolinearity and Curse of Dimensionality are the 2 major phenomenon which adversly impact any machine learning model. With higher degree of multicolinearity, model tend to leave behind the major information that is contained in the mathematical space of the input features. And with Curse of Dimensionality because of the feature space becoming increasingly sparse for an increasing number of dimensions of a fixed-size training dataset, model tend to overfit.
Principal Component Analyis helps adressing these problem and improves the model performance to a great extent.